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Abstract 

We  describe  a  prototype  system  for  multilingual  gist¬ 
ing  of  Web  pages,  and  present  an  evaluation  method¬ 
ology  based  on  the  notion  of  gisting  as  decision  sup¬ 
port.  This  evaluation  paradigm  is  straightforward, 
rigorous,  permits  fair  comparison  of  alternative  ap¬ 
proaches,  and  should  easily  generalize  to  evalua¬ 
tion  in  other  situations  where  the  user  is  faced  with 
decision-making  on  the  basis  of  information  in  re¬ 
stricted  or  alternative  form. 

Introduction:  Gisting  as  Decision 
Support 

The  word  “gisting”  has  been  used  in  a  variety  of  set¬ 
tings.  Informally,  it  simply  means  “getting  the  gist,” 
that  is,  given  some  information  conveyed  by  natural 
language,  understanding  some  characteristic  or  im¬ 
portant  aspect  of  that  information. 

By  definition,  gisting  is  an  activity  in  which  the 
information  taken  into  account  is  less  than  the  full 
information  content  available.  In  this  paper,  we  take 
the  view  that  there  is  another  key  aspect  of  gisting 
that  goes  beyond  simply  selecting  a  subset  of  available 
information,  namely  the  goal  of  supporting  decision¬ 
making.  In  an  environment  where  human  beings  are 
attempting  to  gist  radio  trafhc,  for  example,  radio 
operators  need  to  decide  whether  or  not  to  route  in¬ 
formation  to  electronic  warfare  analysts  (Elsaesser 
1996).  Accordingly,  in  order  to  evaluate  a  particu¬ 
lar  method  for  gisting,  one  must  examine  the  extent 
to  which  gisting  supports  a  decision-making  task. 

The  focus  of  this  paper  is  multilingual  gisting  on 
the  World  Wide  Web,  with  particular  attention  to 
developing  a  methodology  for  evaluating  multilingual 
gisting  based  on  its  role  of  decision  support.  We 
see  such  an  evaluation  methodology  as  important  be¬ 
cause,  although  the  real  proof  of  any  method  is  in  how 
well  it  supports  real  users  at  their  real-world  tasks, 
studying  users  in  fully  natural  settings  can  be  dif¬ 
ficult  to  organize,  and,  more  important,  two  natural 
settings  are  rarely  similar  enough  to  afford  a  fair  com¬ 
parison  between  alternative  approaches  to  the  same 
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Figure  1;  Portion  of  a  page  from  Nihongo  Yellow 
Pages 

task.  In  order  to  address  that  problem,  the  method¬ 
ology  we  propose  is  applicable  to  a  wide  variety  of 
tasks,  simple  to  carry  out  and,  most  important,  de¬ 
fined  in  enough  detail  that  competing  methods  can 
be  evaluated  against  the  same  set  of  data  and  the 
results  compared. 

Gisting  for  the  Web:  A  Simple 
Prototype 

The  motivation  for  this  line  of  research  can  be  de¬ 
scribed  quite  simply.  Imagine  that  you  are  brows¬ 
ing  the  World  Wide  Web  using  your  favorite  Web 
browser.  You  click  a  link,  or  conduct  a  search,  and 
find  yourself  looking  at  the  page  illustrated  in  Fig¬ 
ure  1.  As  it  happens,  you  don’t  know  a  word  of 
Japanese.  What  are  your  options?  Is  it  worth  find¬ 
ing  a  bilingual  dictionary  and  looking  up  words  on 
this  page?  (And  if  so,  which  words?)  Is  it  worth 


following  links  on  this  page  in  hopes  of  finding  some¬ 
thing  understandable?  (And  if  so,  which  links?)  Is 
it  worth  bothering  a  nearby  colleague  who  knows  the 
language  and  asking  for  a  rough  translation?  Is  it 
worth  going  to  the  time  and  expense  of  using  an  on¬ 
line  service  (e.g.  The  Global  Translation  Alliance, 
http;//www. aleph.com)  to  translate  the  page  com¬ 
pletely? 

In  considering  possible  solutions  to  this  scenario, 
we  arrived  at  the  following  principles. 

Avoid  full-scale  machine  translation.  The 

user’s  problem  would  certainly  be  solved  by  a 
fully  automatic  translation  of  the  Web  page  un¬ 
der  consideration.  Unfortunately,  the  state  of  the 
art  in  high  quality  machine  translation  is  typi¬ 
cally  measured  in  words  per  minute  rather  than 
pages  per  minute  (Dorr  1996),  so  even  if  it  is 
possible  to  obtain  a  translation  for  the  page,  the 
user  is  still  faced  with  the  decision  of  whether  or 
not  it  is  worth  sacrificing  the  time  to  obtain  it. 
Keep  the  human  in  the  loop.  We  see  the 
problem  scenario  as  an  opportunity  for  collabo¬ 
ration  between  person  and  machine,  and  in  par¬ 
ticular  an  opportunity  for  the  machine  to  facili¬ 
tate  the  user  in  doing  things  that  people  do  well. 
For  example,  people  are  capable  of  disambiguat¬ 
ing  words  almost  effortlessly  in  context,  although 
this  is  a  task  at  which  computers  currently  per¬ 
form  quite  poorly;  therefore  it  makes  sense  to 
have  the  computer  present  alternatives  rather 
than  making  disambiguation  decisions  for  itself, 
unless  such  decisions  can  be  made  with  very  high 
confidence. 

Aim  for  extensibility.  Our  emphasis  is  on 
modular  and  distributed  design;  for  example,  al¬ 
though  we  do  not  attempt  to  disambiguate  words 
in  order  to  automatically  select  meaning  equiv¬ 
alents  in  the  user’s  language,  a  disambiguation 
component  could  easily  be  added  to  the  system 
without  wholesale  changes  in  its  design.  An  ul¬ 
timate  target  our  efforts  is  the  dissemination  of 
application  programmer  interfaces  (APIs)  that 
will  make  extensible  infrastructure  available  to 
the  community  at  large. 

With  those  principles  in  mind,  we  implemented  a 
prototype  gisting  proxy,  which  assists  users  when  con¬ 
fronted  with  a  Web  page  in  an  unknown  language.^ 
When  invoked  for  a  given  Web  page,  the  gisting  proxy 
behaves  as  follows; 

■^For  the  moment  we  are  glossing  over  who  invokes  the 
gisting  proxy,  and  how.  In  its  full  generality,  this  proxy 
is  part  of  a  general  design  for  a  multilingual  agent  that  is 
aware  of  the  user’s  linguistic  knowledge  and  preferences, 
and  goes  into  action  when  it  detects  a  situation  where  its 
capabilities  might  assist  the  user.  For  the  current  proto¬ 
type,  we  have  implemented  a  gisting  proxy  HTTP  server 
initiated  by  the  user. 


1.  Convert  the  character  encoding  of  the  document 
into  a  standard  encoding. 

2.  Divide  the  Web  page  into  structurally  distinct 
pieces,  using  HTML  markup. 

3.  For  each  piece: 

(a)  Automatically  identify  the  natural  language  in 
which  this  piece  of  text  is  written 

(b)  Invoke  language-dependent  word  identification 
and  normalization 

(c)  Look  up  each  word  in  an  on-line  bilingual  dictio¬ 
nary 

(d)  Present  word-by-word  glosses  in  the  context  of 
the  original  page 

4.  Modify  all  links  on  the  page  so  that  further  nav¬ 
igation  from  this  point  on  will  automatically  go 
through  the  gisting  proxy. 

Step  1  is  necessary  because  different  character  en¬ 
codings  can  be  used  for  the  same  language,  partic¬ 
ularly  in  the  case  of  Asian  languages  (e.g.  EUC-JP 
vs.  Shift-JIS).  Normalization  of  character  encoding 
is  necessary  for  consistency  across  components  of  the 
system. 

Step  2  makes  it  possible  to  analyze  documents 
containing  text  in  multiple  languages.  Small  sub¬ 
document  units  (e.g.  list  items)  motivate  taking 
an  approach  to  automated  language  identification 
(Step  3a)  that  can  work  well  even  when  the  strings 
to  be  identified  are  very  short  and  cannot  be  relied 
upon  to  contain  function  words  (Dunning  1994). 

Depending  on  the  language,  different  measures 
must  be  taken  in  order  to  identify  words  (Step  3b). 
For  example,  in  many  Asian  languages  words  are  typ¬ 
ically  not  delimited  by  spaces,  and  therefore  auto¬ 
matic  word  segmentation  is  necessary  (Matsumoto 
1995).  This  contrasts  with  Romance  languages  such 
as  Spanish,  where  words  are  generally  delimited  by 
spaces  or  punctuation  but  a  small  subset  of  the  lexi¬ 
cal  items  in  the  language  must  be  identified  and  sep¬ 
arated  out  (e.g.  Spanish  damelo  =  da  -\-  me  -\-  lo). 
In  addition,  some  form  of  normalization  may  need 
to  be  done  as  well.  For  example,  in  order  to  locate 
da  in  a  Spanish-English  translation  lexicon  it  may  be 
necessary  to  look  it  up  by  its  root  form,  dar  (to  give). 

Word-by-word  lookup  and  presentation  in  this 
system  (Steps  3c  and  3d)  resemble  the  direct  lexi¬ 
cal  approach  to  machine  translation  investigated  and 
thoroughly  criticized  in  the  1960s  (ALPAC  1966). 
Notably,  however,  the  problem  attacked  by  those 
early  efforts  was  one  of  full  scale  translation,  not  gist¬ 
ing.  We  would  contend  that  with  the  rise  of  the  World 
Wide  Web,  those  early  solutions  have  finally  found 
the  right  problem. 

In  the  current  prototype,  presentation  of  the 
known-language  glosses  for  a  word  are  guided  by  the 
results  of  the  dictionary  lookup.  At  present; 


•  If  the  unknown  language  word  has  a  single  gloss  in 
the  dictionary,  show  that  gloss. 

•  If  the  unknown  language  word  has  multiple  glosses 
in  the  dictionary,  show  up  to  n  of  them  for  some 
customizable  parameter  n  (currently  n  =  3  by  de¬ 
fault),  within  parentheses  and  separated  by  com¬ 
mas.  For  example,  (doctor’s  office,  clinic,  dispen¬ 
sary). 

•  If  the  unknown-language  word  is  not  found  in  the 
dictionary,  then 

—  Show  the  unknown-language  word  itself,  if  the 
character  set  of  the  language  is  the  same  as  a 
language  the  user  knows  (e.g.  an  unknown  word 
in  French  would  be  shown  to  someone  who  knows 
English,  since  both  use  the  Latin- 1  character 
set) . 

—  Show  an  ellipsis  (. . .)  otherwise. 

This  treatment  of  words  not  appearing  in  the  dic¬ 
tionary  follows  the  general  principle  that  users  should 
be  given  information  that  might  be  helpful  —  such  as 
possible  cognates  —  but  minimally  distracted  by  un¬ 
familiar  scripts.  The  present  implementation  reflects 
two  extremes  for  unknown  words,  namely  present¬ 
ing  them  as-is  or  leaving  them  out  entirely,  but  other 
strategies  are  possible. 

Figure  2  shows  the  result  of  following  this  process 
for  the  page  in  Figure  1.  For  comparison.  Figure  3 
shows  the  same  entries  as  they  appear  in  an  English 
version  of  the  same  business  directory. 

Our  current  implementation  of  the  prototype  han¬ 
dles  gisting  from  Japanese,  Erench,  and  Spanish  to 
English,  though  in  this  paper  we  concern  ourselves 
only  with  Japanese-English  gisting.  Given  the  sim¬ 
plicity  of  the  approach,  the  main  limiting  factor  in 
adding  more  languages  to  the  list  is  the  availability 
of  bilingual  dictionaries,  though  we  expect  that  this 
problem  may  be  ameliorated  to  some  extent  by  auto¬ 
matic  algorithms  for  acquisition  of  bilingual  lexicons 
(Melamed  1996). 

Evaluation  Design  Criteria 

The  gisted  text  that  appears  in  Eigure  2  bears  little 
resemblance  to  an  English  translation  of  the  Japanese 
content  in  Eigure  1.  However,  it  does  provide  enough 
information  to  support  two  critical  decisions  facing 
the  user  who  has  arrived  at  that  page; 

•  Deciding  whether  a  link  is  worth  following 

•  Deciding  whether  some  text  is  worth  having  trans¬ 
lated 

A  user  interested  in,  say,  podiatrists,  can  discern  from 
the  gisted  text  in  Eigure  2  that  the  first  entry  in  the 
Health  category  is  probably  not  worth  navigating  fur¬ 
ther.  Similarly,  someone  interested  in  medical  equip¬ 
ment  manufacturers  might  well  decide  that  the  third 


entry  is  worth  translating,  especially  if  they  have  a 
particular  interest  in  companies  in  Osaka. 

The  central  issue  of  this  paper  is  how  to  evaluate 
the  extent  to  which  a  gisting  method  helps  the  user  to 
make  decisions  of  this  kind.  In  designing  a  method¬ 
ology  for  answering  that  question,  we  were  guided  by 
the  following  criteria; 

Approximate  real  Web-based  decision 
tasks.  Since  we  have  characterized  the  role  of 
gisting  in  terms  of  decision  support,  what  must 
be  evaluated  is  the  extent  to  which  gisted  mate¬ 
rial  facilitates  decisions  that  resemble  the  choices 
available  to  the  user  when  faced  with  multilin¬ 
gual  content  on  a  Web  page.  This  considera¬ 
tion  led  us  to  select  a  categorization  paradigm, 
since  both  the  real  world  tasks  involve  a  tradeoff 
between  the  time  invested  in  assessing  relevance 
and  the  accuracy  of  the  decision  as  well  as  the 
need  to  select  an  appropriate  action  based  on 
that  assessment. 

Minimize  a  priori  biases.  Users  seeking  in¬ 
formation  on  the  Web  are  seldom  given  a  pithy 
description  of  a  topic  by  someone  else.  There¬ 
fore  it  is  important,  in  designing  the  experimen¬ 
tal  task,  to  allow  users  to  form  their  own  inter¬ 
nal  characterization  of  a  topic  or  category,  rather 
than  pre-assigning  category  labels  that  incorpo¬ 
rate  the  experimenters’  perceptions  or  biases. 
Make  the  task  easy  to  create.  It  is  hoped 
that  the  methodology  proposed  here  can  serve 
as  an  outline  for  other  experimenters  investigat¬ 
ing  multilingual  gisting,  spoken  language  gisting, 
translation,  summarization,  and  related  topics. 
Therefore  we  aim  for  an  experimental  design  that 
requires  little  in  the  way  of  specialized  appara¬ 
tus,  preparation,  and  the  like. 

The  experimental  design,  adopting  these  criteria, 
is  relatively  straightforward.  We  define  a  task  in 
which  all  subjects  are  faced  with  the  same  catego¬ 
rization  problem,  but  some  of  those  subjects  are  given 
materials  in  English  to  categorize  while  other  subjects 
are  given  the  same  content  to  categorize  but  in  the 
form  of  gisted  text.  If  the  subjects  given  gisted  ma¬ 
terials  make  similar  decisions  to  the  subjects  given 
the  English  materials  (allowing  for  normal  variabil¬ 
ity  in  people’s  judgments),  we  can  conclude  that  the 
quality  of  the  gisting  is  reasonable.  The  next  section 
gives  the  details  of  the  experiment,  including  a  way 
to  assess  the  results  quantitatively. 

Evaluation  Study 

Materials 

Experimental  items  were  selected  from  the  Nihongo 
Yellow  Pages  (NYP),  a  business  directory  site  on  the 
World  Wide  Web  (Nihongo  Yellow  Pages  1996).  The 
site  was  chosen  because  it  contains  information  across 
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Figure  2;  Gisted  items  from  Nihongo  Yellow  Pages 


Figure  3;  Corresponding  English  items  from  Nihongo  Yellow  Pages 


a  variety  of  topic  areas,  because  each  business  direc¬ 
tory  listing  consists  of  a  concise  and  informative  de¬ 
scription,  and  because  most  listings  are  available  in 
both  Japanese  and  English.  In  our  experiments  we 
used  listings  from  NYP’s  Education,  Einance,  What’s 
New,  Entertainment,  and  Health  categories,  selecting 
a  total  of  73  business  listings  at  random  from  those 
areas. 

Eor  each  of  these  listings  we  created  a  3  x  5-inch 
index  card  with  a  business  advertisement  in  English 
and  a  corresponding  card  with  a  “gisted”  version  of 
the  same  content  as  expressed  in  Japanese.  By  way 
of  illustration,  Eigure  3  shows  three  items  in  English, 
with  their  corresponding  gisted  items  appearing  in 
Eigure  2. 

Procedure 

Creating  Topical  Categories  In  order  to  create 
topical  categories  in  an  objective  way,  we  randomly 
selected  32  of  the  73  English  cards  and  gave  them 
to  3  different  subjects,^  with  instructions  to  sort  the 
cards  “into  4-6  piles  of  roughly  equal  size,  placing 
cards  in  the  same  pile  when  you  think  they  should 
’go  together’,  for  example  because  they  are  related 
to  similar  topics.”  One  subject  created  4  piles,  an¬ 
other  6,  and  the  third  7  piles.  We  chose  the  6  piles 
created  by  the  second  subject  as  defining  the  topical 
categories  for  the  remainder  of  the  study,  noting  that 
the  topic  distinctions  made  by  the  three  subjects  were 
qualitatively  similar  overall.^ 

Categorization  Task:  The  Control  Condition 

A  set  of  6  subjects  participated  in  the  control  condi¬ 
tion  of  the  experiment.  The  procedure  had  two  parts 
(see  Eigure  4). 

1.  Eirst,  subjects  were  presented  with  the  6  piles  of 
English  cards  created  as  described  above.  They 
were  asked  to  read  through  each  pile  and  decide 
“what  you  think  each  one  is  about.”  As  a  memory 
aid,  subjects  were  encouraged  to  write  a  description 
of  their  choosing  on  a  Post-It  note  for  each  pile,  and 
place  the  note  next  to  the  corresponding  pile. 

2.  Having  formed  their  own  impression  of  the  6  topical 
categories,  subjects  in  the  control  condition  were 
now  given  32  new  randomly-selected  cards  in  En¬ 
glish.  They  were  instructed  that  for  each  new  card, 
they  should  decide  in  which  of  the  6  categories  it 
“belongs”  and  place  it  next  to  the  corresponding 
pile.  They  were  also  given  the  option  of  placing 
cards  in  a  seventh  “none  of  the  above”  category. 

^  All  subjects  in  this  experiment  were  employees  of  Sun 
Microsystems  in  Chelmsford,  Massachusetts,  solicited  as 
volunteers.  All  were  fluent  in  English  and  nobody  who 
saw  Japanese  materials  was  at  all  familiar  with  Japanese. 

^As  an  additional  piece  of  information,  we  had  each 
subject  write  a  short  description  of  the  topic  for  each  pile, 
though  those  descriptions  were  not  used  in  the  study. 


Subjects  were  told  to  take  as  long  as  they  liked  on 
both  parts  of  the  categorization  task,  though  Part  2 
was  timed  for  possible  future  use  of  that  information. 

Categorization  Task:  The  Experimental  Con¬ 
dition  A  set  of  8  subjects  participated  in  the  exper¬ 
imental  condition.  Part  1  of  the  experimental  condi¬ 
tion  was  completely  identical  to  Part  1  of  the  con¬ 
trol  condition:  subjects  looked  at  exactly  the  same  6 
piles  of  English  cards  and  formed  their  own  mental 
description  of  each  topical  category,  writing  down  a 
short  description  as  a  memory  aid. 

Part  2  was  also  identical,  with  one  crucial  excep¬ 
tion:  instead  of  being  given  cards  in  English  to  place 
into  categories,  subjects  were  given  the  corresponding 
gisted  Japanese  cards. 

Categorization  Task:  Random  Baseline  In  or¬ 
der  to  obtain  a  lower  bound  for  performance  on  this 
task,  the  computer  did  8  runs  placing  the  gisted 
Japanese  cards  into  the  7  categories  at  random.  We 
also  computed  lower  bounds  with  the  computer  mak¬ 
ing  a  forced  choice,  i.e.  not  allowing  random  selection 
to  pick  the  “none  of  the  above”  category;  the  results 
differed  negligibly. 

Analysis 

The  categorization  data  gathered  in  the  experiment 
were  analyzed  following  the  method  of  Hripcsak  et 
al.  (Hripcsak  et  al.  1995).  In  their  study,  they  com¬ 
pared  the  performance  of  physicians,  laypersons,  and 
several  computer  programs  on  the  task  of  classify¬ 
ing  chest  radiograph  reports  according  to  the  pres¬ 
ence  or  absence  of  6  medical  conditions.  Our  adap¬ 
tation  of  their  analysis  is  almost  completely  direct, 
with  subjects  in  the  control  condition  (English  cards) 
corresponding  to  the  physicians,  subjects  in  the  ex¬ 
perimental  condition  (gisted  cards)  corresponding  to 
laypersons,  and  each  run  of  our  random  baseline  cor¬ 
responding  to  a  subject  in  their  baseline  conditions 
(simple  keyword-based  classification). 

The  basic  idea  in  the  analysis  is  to  compute  the 
“distance”  between  subjects  on  the  basis  of  their  cat¬ 
egorization  behavior,  and  seeing  whether  the  aver¬ 
age  distance  between  an  experimental  subject  and  the 
members  of  the  control  group  is  greater  than  the  av¬ 
erage  distance  of  control  group  members  from  each 
other.  We  compute  the  distance  dijj-  between  two 
subjects  j  and  k  for  experimental  item  i  as  the  num¬ 
ber  of  topical  categories  where  the  subjects  disagreed 
for  this  item,  i.e.  0  if  they  placed  item  i  into  the  same 
category  and  2  if  they  did  not.^  The  overall  distance 
from  subject  j  to  subject  k  is  then  just  their  average 


'^This  distance  measure  was  used  because  Hripcsak  et 
al.  included  the  more  general  case  of  allowing  an  item 
to  be  placed  into  multiple  categories,  i.e.  in  their  case 
distance  could  range  from  0  to  6. 
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Figure  4;  Categorization  of  new  items 


distance  across  all  N  items; 

N 

dj  ^  ^  ^  d{j  ^  / A^.  ( 1 ) 

i  —  l 

The  main  figure  of  interest  in  this  study  is  how 
much  the  categorization  behavior  of  subjects  in  the 
experimental  (gisted  cards)  condition  differs  from  be¬ 
havior  of  subjects  in  the  control  (English  cards)  con¬ 
dition.  The  average  distance  from  a  gisted-card  sub¬ 
ject  to  the  English-card  subjects  is 
J 

dk  =  '^djk/J  (2) 

i=i 

where  J  is  the  number  of  English-card  subjects.  The 
corresponding  average  distance  for  English-card  sub¬ 
jects  is  computed  similarly,  though  naturally  the  av¬ 
eraging  excludes  the  distance  of  each  subject  from 
himself  or  herself; 

di  =  E  (3) 

Hripcsak  et  al.  also  give  a  method  for  computing 
confidence  intervals  for  these  figures.  In  addition 
they  point  out  that  the  analysis  holds  equally  well 
for  other  inter-rater  distance  measures  such  as  Co¬ 
hen’s  K,  though  they  comment  that  in  their  study 
Cohen’s  k  and  the  above  distance  measure  produced 
essentially  the  same  results. 

Results  Eig  5  shows,  for  each  subject,  a  point  (and 
95%  confidence  interval),  representing  its  distance  on 
average  from  the  judgments  of  the  subjects  in  the 
English-card  (control)  condition.  (Recall  that  dis¬ 
tances  range  from  0  to  2.)  As  one  should  expect,  the 
categorization  behavior  of  subjects  given  degraded  in¬ 
formation  (gisted  cards)  is  far  closer  to  the  control 


Eigure  5;  Left  to  right;  English  condition,  Gisted 
condition.  Random  condition 

group  than  random  choice,  but  generally  appears  ap¬ 
pears  to  differ  from  that  of  subjects  in  the  control 
group,  who  were  given  full  information  in  the  form  of 
English  cards. 

We  plan  to  replicate  the  study  with  a  greater  num¬ 
ber  of  subjects,  in  order  to  better  assess  the  signifi¬ 
cance  of  the  variability  that  appears  within  the  con¬ 
trol  group  —  in  particular,  whether  the  degree  of  vari¬ 
ance  in  the  control  group,  suggested  by  comparatively 
greater  distances  for  the  4th  and  5th  subjects,  will 
turn  out  to  be  present  or  not  given  a  larger  sample.  In 
addition,  it  has  been  suggested  that  an  additional,  in¬ 
formative  control  in  this  experiment  would  be  a  group 
that  performed  the  experiment  using  cards  entirely  in 
Japanese  (for  both  the  topical  “piles”  and  the  cards 
to  be  categorized);  the  materials  for  this  condition 


are  easily  created,  but  our  ability  to  perform  the  ex¬ 
periment  will  depend  upon  the  availability  of  subjects 
who  are  fluent  in  Japanese. 

Discussion 

Our  central  concern  in  this  paper  is  not  the  method 
used  for  gisting  —  though  of  course  that  is  also  of 
interest  —  but  rather  the  evaluation  methodology  we 
have  designed.  Were  we  to  extend  the  gisting  pro¬ 
totype,  for  example  by  improving  dictionary  cover¬ 
age,  adding  automatic  disambiguation,  or  manipu¬ 
lating  word  order,  the  value  added  by  those  changes 
could  be  measured  simply  and  effectively  by  adding 
a  condition  to  the  above  experiment  in  which  sub¬ 
jects  received  cards  with  the  putatively  improved  in¬ 
formation.  Similarly,  anyone  else’s  method  for  con¬ 
veying  the  content  of  Japanese  Web  pages  (e.g.  Tem¬ 
ple,  (Vanni  &  Zajac  to  appear))  can  be  evaluated  in 
terms  of  its  value  for  gisting  (i.e.  decision  support) 
simply  by  creating  the  corresponding  materials  from 
the  same  Japanese  items  we  used  to  produce  gisted 
cards  in  our  experiment.  If  one  method  for  produc¬ 
ing  gists  is  better  than  another,  then  subjects  given 
that  information  should  behave  closer  to  the  “ideal” 
case  (defined  here  by  the  behavior  of  subjects  who 
receive  information  in  English),  as  assessed  quantita¬ 
tively  by  the  distance  measure.  Additional  measures 
might  also  be  brought  into  play,  such  as  a  compari¬ 
son  of  the  time  it  takes  to  make  decisions  given  vari¬ 
ant  forms  of  information,  or  differences  in  the  time- 
accuracy  tradeoff  that  results  when  time  limitations 
are  imposed. 

The  evaluation  methodology  we  have  proposed 
generalizes  easily  to  any  number  of  other  tasks  that 
have  similar  characteristics,  namely  domains  in  which 
restricted  or  alternate-form  information  is  used  in 
support  of  a  decision-making  because  of  limits  on 
time,  space,  or  user  knowledge.  Some  examples; 

•  In  environments  where  text  summarization  is  used 
to  decide  the  disposition  of  full  documents,  e.g. 
routing  of  memoranda  or  scientific  articles,  this 
methodology  could  be  used  to  evaluate  the  qual¬ 
ity  of  summaries. 

•  In  environments  where  key  elements  are  extracted 
from  a  stream  of  speech  input,  e.g.  automatic  mon¬ 
itoring  of  radio  trafhc,  this  methodology  could  be 
used  to  evaluate  the  extraction  technology. 

•  In  environments  where  decisions  are  made  on  the 
basis  of  text-to-speech  output,  e.g.  spoken  lan¬ 
guage  interfaces,  this  methodology  could  be  used 
to  evaluate  the  clarity  of  the  speech  synthesizer. 

•  In  environments  where  alternative  versions  of  text 
or  images  can  be  presented,  e.g.  the  selection  of 
Web-based  advertising  based  on  client  bandwidth, 
this  methodology  could  be  used  to  assess  the  im¬ 
pact  of  the  advertisement  format  on  users’  interest 
level. 


We  will  be  happy  to  make  our  experimental  ma¬ 
terials  available  to  other  researchers  on  request. 
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